{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "In this lab we'll demonstrate several common techniques and helpful tools used in a model building process:\n",
    "\n",
    "- Use Sklearn to generate polynomial features and rescale them\n",
    "- Create folds for cross-validation\n",
    "- Perform a grid search to optimize hyper-parameters using cross-validation\n",
    "- Create pipelines to perform grids search in less code\n",
    "- Improve upon a baseline model incrementally by adding in more complexity\n",
    "\n",
    "This lab will require using several Sklearn classes. It would be helpful to refer to appropriate documentation:\n",
    "- http://scikit-learn.org/stable/modules/preprocessing.html\n",
    "- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler\n",
    "- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures\n",
    "- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
    "- http://scikit-learn.org/stable/modules/pipeline.html#pipeline\n",
    "\n",
    "Also, here is a helpful tutorial that explains how to use much of the above:\n",
    "- https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/\n",
    "\n",
    "Like always, let's first load in the data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import KFold\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "cwd = os.getcwd()\n",
    "datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
    "\n",
    "data = pd.read_csv(datadir + 'Cell2Cell_data.csv', header=0, sep=',')\n",
    "\n",
    "#Randomly sort the data:\n",
    "data = data.sample(frac = 1)\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we're going to prep the data. From prior analysis (Churn Case Study) we learned that we can drop a few variables, as they are either highly redundant or don't carry a strong relationship with the outcome.\n",
    "\n",
    "After dropping, we're going to use the SkLearn KFold class to set up cross validation fold indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Prior analysis (from Churn Case study) has shown that we can drop a few redundant variables\n",
    "#We want to drop a few to speed up later calculations\n",
    "dropvar_list = ['incalls', 'creditcd', 'marryyes', 'travel', 'pcown']\n",
    "data_subset = data.drop(dropvar_list, 1)\n",
    "\n",
    "#Set up X and Y\n",
    "X = data_subset.drop('churndep', 1)\n",
    "Y = data_subset['churndep']\n",
    "\n",
    "#Use Kfold to create 4 folds\n",
    "kfolds = KFold(#Student -nput code here)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next let's use cross-validation to build a baseline model. We're going to use LR with no feature pre-processing. We're going to look at both L1 and L2 regularization with different weights. We can do this very succinctly with SkLearns GridSearchCV package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#1st, set up a paramater grid\n",
    "param_grid_lr = {'C':#Student put code here, \n",
    "                 'penalty':#Student put code here}\n",
    "\n",
    "#2nd, call the GridSearchCV class, use LogisticRegression and 'roc_auc' for scoring\n",
    "lr_grid_search = GridSearchCV(#Student put code here) \n",
    "lr_grid_search.fit(X, Y)\n",
    "\n",
    "#3rd, get the score of the best model and print it\n",
    "best_1 = #Student put code here\n",
    "print(best_1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Next let's look at the best-estimator chosen to see what the parameters were\n",
    "lr_grid_search.#Student put code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see if we can beat this by standardizing the features. We'll approach this using the GridSearchCV class but also build a pipeline. Later we'll extend the pipeline to allow for feature engineering as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
    "#Build a list of steps, where the first is StandardScaler and the second is LogisticRegression\n",
    "#The last step should be an estimator.\n",
    "\n",
    "steps = [('scaler', #Student put code here,\n",
    "         ('lr', #Student put code here)]\n",
    "\n",
    "#Now set up the pipeline\n",
    "pipeline = Pipeline(#Student put code here)\n",
    "\n",
    "#Now set up the parameter grid, paying close to the correct convention here\n",
    "parameters_scaler = dict(lr__C = #Student put code here,\n",
    "                         lr__penalty = #Student put code here)\n",
    "\n",
    "#Now run another grid search\n",
    "lr_grid_search_scaler = GridSearchCV(#Student put code here)\n",
    "                        \n",
    "#Don't forget to fit this GridSearchCV pipeline\n",
    "#Student put code here\n",
    "\n",
    "#Again, print the score of the best model\n",
    "best_2 = #Student put code here\n",
    "print(best_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Let's see the model after scaling. Did the optimal parameters change?\n",
    "lr_grid_search_scaler.best_estimator_.steps[-1][1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've built a pipeline estimator that performs feature scaling and then logistic regression, let's add to it a feature engineering step. We'll then again use GridSearchCV to find an optimal parameter configuration and see if we can beat our best score above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
    "# Step 1 - PolynomialFeatures\n",
    "# Step 2 - StandardScaler\n",
    "# Step 3 - LogisticRegression\n",
    "\n",
    "steps_poly = [#Student put code here]\n",
    "\n",
    "#Now set up the pipeline\n",
    "pipeline_poly = #Student put code here\n",
    "\n",
    "#Now set up a new parameter grid, use the same paramaters used above for logistic regression, \n",
    "#but add polynomial features up to degree 2 with and without interactions. \n",
    "parameters_poly = dict(#Student put code here)\n",
    "\n",
    "#Now run another grid search\n",
    "lr_grid_search_poly = #Student put code here\n",
    "lr_grid_search_poly.fit(X, Y)\n",
    "\n",
    "best_3 = lr_grid_search_poly.best_score_\n",
    "print(best_3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Let's look at the best estimator, stepwise\n",
    "lr_grid_search_poly.best_estimator_.steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now make a bar chart to plot results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "results = np.array([best_1, best_2, best_3])\n",
    "labs = ['LR', 'Scaler-LR', 'Poly-Scaler-LR']\n",
    "\n",
    "fig = plt.figure(facecolor = 'w', figsize = (12, 6))\n",
    "ax = plt.subplot(111)\n",
    "\n",
    "width = 0.5\n",
    "ind = np.arange(3)\n",
    "rec = ax.bar(ind + width, results, width, color='r')\n",
    "\n",
    "ax.set_xticks(ind + width)\n",
    "ax.set_xticklabels(labs, size = 14)\n",
    "ax.set_ylim([0.6, max(results)*1.1])\n",
    "\n",
    "plt.plot(np.arange(4), max(results) * np.ones(4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
